Data Labeling Startups funded by Y Combinator (YC) 2026

May 2026

Browse 18 of the top Data Labeling startups funded by Y Combinator.

We also have a Startup Directory where you can search through over 5,000 companies.

Hub.xyz
P2026
• Active • 10 employees • Palo Alto, CA, USA

ai
artificial-intelligence
big-data
data-labeling

Shofo
W2026
• Active • 4 employees • San Francisco
We are building the world’s largest video library. We've aggregated billions of videos into a searchable index and use agents to find and label the exact datasets a lab needs on demand. If a lab needs 100K hours of cooking videos where someone is holding a pan, with reasoning annotations on top, our agents search the index, extract the matching subset, route it through our labeling pipeline, and deliver a custom dataset in days, not months.
data-labeling
artificial-intelligence
infrastructure

Datoric
W2026
• Active • 2 employees • San Francisco
data-labeling
ml
ai

Sciloop
F2025
• Active • 2 employees • San Francisco, CA, USA
Sciloop creates expert-level math and physics problems that frontier AI models can't solve, then sells the data to AI labs for training and evaluation. Our problems are created by IPhO and IMO medalists — the top 0.01% of STEM talent globally. On our benchmark, models like GPT 5.4 Pro and Gemini 3.1 Pro score 0-5% on our hardest problems. We work with AI labs to supply continuous, fresh training data that pushes the frontier of mathematical and scientific reasoning. Founded by Bilal and Osman, International Physics Olympiad medalists from MIT with hands-on ML research experience at MIT CSAIL.
artificial-intelligence
big-data
data-labeling
marketplace

Panels
S2025
• Active • 2 employees • San Francisco, CA, USA
Panels is an audio data platform that delivers high-quality speech datasets from vetted, diverse contributors to power training and evaluation of foundational voice models.
data-labeling
b2b
artificial-intelligence

Liva AI
S2025
• Active • 2 employees • San Francisco, CA, USA
Speech models trained on internet data still lack realistic results. We solve this by collecting targeted training data for model labs. We hope to create a world where AI feels more human.
b2b
big-data
data-labeling
marketplace
artificial-intelligence

Besimple AI
P2025
• Active • 6 employees • San Francisco
We are building the data layer for AI, starting with audio. We start with data collection, curating our own proprietary set of diverse conversational data covering a wide range of languages, dialects and accents. We then leverage human expert audio annotators and our own annotation platform to process audio data for Automatic Speech Recognition. With human level transcription and diarization, our data help push the audio model frontier. Today we have over millions of hours of conversational data, and growing. If you need audio data for training or evaluating your voice models or voice agents, reach out! We offer flexible licensing deals that work for startups and enterprises, with minimal process. Audio data should besimple :)
artificial-intelligence
data-labeling
aiops

Cartpole
P2025
• Active • 1 employees • San Francisco, CA, USA
We're creating reinforcement learning environments for training frontier models.
reinforcement-learning
ml
ai
data-labeling

Sureform
P2025
• Active • 2 employees • Palo Alto, CA, USA
We collect high-quality human data, across diverse interactions and environments, to help advance the next generation of multimodal AI and robotics models.
data-labeling
robotics
marketplace
artificial-intelligence

AfterQuery
W2025
• Active • 30 employees • San Francisco, CA, USA
AfterQuery is an applied research lab curating data solutions for frontier foundation model development. Serving every frontier AI lab.
b2b
artificial-intelligence
ai
big-data
data-labeling

Unbound
S2024
• Active • 7 employees • San Francisco, CA, USA
cybersecurity
artificial-intelligence
privacy
data-labeling

Sieve
W2022
• Active • 18 employees • San Francisco, CA, USA
Sieve is the only AI research lab exclusively focused on video data. Video already makes up 80% of internet traffic and has become the dominant medium driving creativity, communication, gaming, AR/VR, and robotics. Unlocking the ability to truly model video is the key to breakthroughs across all of these domains but progress has been bottlenecked by one thing: high-quality training data. That’s where Sieve comes in. We bring together exabyte-scale video infrastructure, novel video understanding techniques, and dozens of diverse data sources to create datasets that push the frontier of video modeling. This unique combination allows us to deliver data with unmatched precision, quality, and speed which has earned the trust of frontier AI labs, Fortune 100 companies, and fast-growing generative AI startups.
video
developer-tools
ai
data-engineering
data-labeling

Spade
W2022
• Active • 25 employees • New York, NY, USA
Spade is the next generation of fintech infrastructure. We’re building a financial data enrichment API purpose built to empower our customers to uncover the truth hidden within their transaction data. We use our vast, ground-truth merchant data set to decipher cryptic transactions, helping customers underwrite, detect fraud, build better banking infrastructure and get a unique understanding of their users’ spending habits.
fintech
machine-learning
payments
data-labeling
ai

Lightly
S2021
• Active • 5 employees • Zürich, Switzerland
When ML teams send their data to companies like Scale.ai for labeling, most can only afford to label 1% or less of their datasets. But today they don’t have a good way to pick which 1% to label. We help them pick the best 1% of their data to label. By labeling the most representative data, they significantly improve model accuracy at the same cost.
machine-learning
data-labeling

Centaur
W2019
• Active • 45 employees • Boston, MA, USA
The best AI models aren’t just trained and evaluated with human data; they’re built with superhuman data. The strongest datasets emerge through collective intelligence, where humans and machines work together to outperform either one alone. At Centaur, we create superior quality data by turning annotation into an arena where experts and AI compete.
data-labeling
crowdsourcing
data-science
artificial-intelligence

Sepal AI
S2024
• Acquired • 15 employees • San Francisco, CA, USA
Sepal is a data research company on a mission to advance human knowledge and capabilities through safe AI. We partner with the world’s leading AI labs and enterprises to help their models get better at the tasks people actually want them to do. We’ve built a Cloud-Native Agent Dataset Factory which turns the process of generating evaluation and training data from manual, inconsistent, and labor-intensive into something automated, standardized, and scalable. Sepal AI was founded in 2024 by engineers and operators from Vercel and Turing. We went through Y Combinator, raised several million dollars from leading investors, and already count multiple Fortune 500s and top AI research labs as paying customers.
data-labeling
aiops
reinforcement-learning
ai

Deasy Labs
S2023
• Acquired • 8 employees • New York City
Deasy Labs was acquired by Collibra in July 2025 (global leader in enterprise data governance). Deasy Labs provides metadata orchestration for AI workflows. Deasie's platform provides the best way for AI teams to create and embed high-quality, customized metadata into their AI workflows (e.g., RAG, Agentic frameworks). Our three founders (from Amazon, McKinsey/QuantumBlack & MIT) previously built an ML data governance tool from 0 to 1 within McKinsey, which we deployed with 11 Fortune 500 companies. We saw in early 2023 the ability to create high-quality metadata (without reliance on domain experts) would be a key factor in achieving the accuracy & speed in GenAI applications required for production. Our investors include General Catalyst, Y Combinator, RTP Global and world experts in enterprise data. Website: https://deasylabs.com
ai-assistant
data-labeling
databases
big-data
artificial-intelligence

JumpWire
W2022
• Acquired • 2 employees • New York, NY, USA
JumpWire is a data protection platform that adds advanced data security controls between APIs, applications and databases. JumpWire automatically identifies sensitive properties inside large data sets and gives developers full control over which people and applications can access or update records containing sensitive info. Examples uses include restricting who can read customer PII to members of the customer service team, giving on-call engineers elevated access to production, or splitting user records between regions for GDPR purposes. JumpWire’s approach to securing data in-place minimizes the risk of data leaks exposing sensitive information or mishandling by other applications and vendors. The exact security scheme applied to data is defined by policies that align with an organization’s existing InfoSec program. JumpWire helps companies who maintain information security with compliance programs such as SOC or HIPAA. They are processing sensitive data, often from their own customers, and exceed security best practices as a competitive advantage. JumpWire provides defense at depth to data and sits alongside access controls and Layer 4 encryption to provide a comprehensive data security solution. JumpWire is unique from solutions such as data vaults by installing inside our customers’ own infrastructure and clouds. It is interoperable with existing applications and databases, which eliminates the need for large data migrations or code refactoring. Lower-level approaches to data security, such as encryption at rest, are too blunt and lack the ability to differentiate between properties in the data itself. Its scope is limited to physical storage, and security is lost as soon as an application or query loads the data.
security
data-labeling
databases

Data Labeling Startups funded by Y Combinator (YC) 2026

Hottest Startup Categories